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Abstract 

Belief Propagation (BP) is a widely used approximation for 
exact probabilistic inference in graphical models, such as 
Markov Random Fields (MRFs). In graphs with cycles, how¬ 
ever, no exact convergence guarantees for BP are known, in 
general. For the case when all edges in the MRF carry the 
same symmetric, doubly stochastic potential, recent works 
have proposed to approximate BP by linearizing the update 
equations around default values, which was shown to work 
well for the problem of node classification. The present paper 
generalizes all prior work and derives an approach that ap¬ 
proximates loopy BP on any pairwise MRF with the problem 
of solving a linear equation system. This approach combines 
exact convergence guarantees and a fast matrix implementa¬ 
tion with the ability to model heterogenous networks. Experi¬ 
ments on synthetic graphs with planted edge potentials show 
that the linearization has comparable labeling accuracy as BP 
for graphs with weak potentials, while speeding-up inference 
by orders of magnitude. 

1 Introduction 

Belief Propagation (BP) is an iterative message-passing 
algorithm for performing inference in graphical models 
(GMs), such as Markov Random Fields (MRFs). BP cal¬ 
culates the marginal distribution for each unobserved node, 
conditional on any observed nodes (Pearl 1988). It achieves 
this by propagating the information from a few observed 
nodes throughout the network by iteratively passing infor¬ 
mation between neighboring nodes. It is known that when 
the graphical model has a tree structure, then BP converges 
to the true marginals (according to exact probabilistic infer¬ 
ence) after a finite number of iterations. In loopy graphs, 
convergence to the correct marginals is not guaranteed; in 
fact, it is not guaranteed at all, and using BP can lead to well- 
documented convergence problems (Sen et al. 2008). While 
there is a lot of research on convergence of BP (Elidan, Mc- 
Graw, and Roller 2006; Ihler, Fisher III, and Willsky 2005; 
Mooij and Kappen 2007), exact criteria for convergence are 
not known (Murphy 2012), and most existing bounds for BP 

This paper is a significantly extended version of a paper with the 
same title presented at the 31st AAAI Conference on Artificial In¬ 
telligence (AAAI-17). The present paper contains all proofs and 
details on the experimental results. Possible future updates will be 
made available on CORR at http://arxiv.org/abs/1502.04956. 


on general pairwise MRFs give only sufficient convergence 
criteria, or are for restricted cases, such as when the under¬ 
lying distributions are Gaussians (Malioutov, Johnson, and 
Willsky 2006; Su and Wu 2015; Weiss and Freeman 2001). 

Semi-supervised node classification. BP is also a ver¬ 
satile formalism for semi-supervised learning; i.e., assign¬ 
ing classes to unlabeled nodes while maximizing the num¬ 
ber of correctly labeled nodes (Roller and Friedman 2009, 
ch. 4). The goal is to predict the most probable class for 
each node in a network independently, which corresponds 
to the Maximum Marginal (MM) assignment (Domke 2013; 
Weiss 2000). Let P be a probability distribution over a set 
of random variables X U Y. MM-inference (or “MM decod¬ 
ing”) searches for the most probable assignment yi for each 
unlabeled node Yi independently, given evidence X = x: 

MM(y|x) = {argmaxP(Yi = ?/i|X = x)|Y € Y} 

Vi 

Notice that this problem is simpler than finding the actual 
marginal distribution. It is also different from finding the 
Maximum A-Posteriori (MAP) assignment (the “most prob¬ 
able configuration”), which is the mode or the most probable 
joint classification of all non-evidence variables:' 

MAP(y|x) = arg maxP(Y = y|X = x) 
y 

Convergent message-passing algorithms. There has 
been much research on finding variations to the update equa¬ 
tions of BP that guarantee convergence. These algorithms 
are often similar in structure to the non-convergent algo¬ 
rithms, yet it can be proven that the value of the variational 
problem (or its dual) improves at each iteration (Hazan and 
Shashua 2008; Heskes 2006; Meltzer, Globerson, and Weiss 
2009). Another body of recent papers have suggested to 
solve the convergence problems of MM-inference by lin¬ 
earizing the update equations. Rrzakala et al. study a form 
of linearization for unsupervised classification called “spec¬ 
tral redemption” in the stochastic block model. That model 

'See (Murphy 2012, ch. 5.2.1) for a detailed discussion on why 
MAP has some undesirable properties and is not necessarily a “rep¬ 
resentative” assignment. While in theory it is arguably preferable 
to compute marginal probabilities, in practice researchers often use 
MAP inference due to the availability of efficient discrete optimiza¬ 
tion algorithms (Korc, Kolmogorov, and Lampert 2012). 



is unsupervised and has no obvious way to include super¬ 
vision in its setup (i.e., it is not clear how to leverage la¬ 
beled nodes). Donoho, Maleki, and Montanari propose “ap¬ 
proximate message-passing” (AMP) as an iterative thresh¬ 
olding algorithm for compressed sensing that is largely in¬ 
spired by BP. Koutra et al. linearize BP for the case of two 
classes and proposed “Fast Belief Propagation” (FaBP) as 
a method to propagate existing knowledge of homophily or 
heterophily to unlabeled data. This framework allows one 
to specify a homophily factor h (h > 0 for homophily or 
h < 0 for heterophily) and to then use this algorithm with 
exact convergence criteria for binary classification. Gatter- 
bauer et al. derive a multivariate (“polytomous”) general¬ 
ization of FaBP from binary to multiple labels called “Lin¬ 
earized Belief Propagation” (LinBP). Both aforementioned 
papers show considerable speed-ups for the application of 
node classification and relational learning by transforming 
the update equations of BP into an efficient matrix formula¬ 
tion. However, those papers solve only special cases: FaBP 
is restricted to two classes per node (de facto, one single 
score). LinBP can handle multiple classes, but is restricted 
to one single node type, one single edge type, and a potential 
that is symmetric and doubly stochastic (see Fig. 1).^ 

Contributions. This paper derives a linearization of BP 
for arbitrary pairwise MRFs, which transforms the param¬ 
eters of an MRF into an equation system that replaces mul¬ 
tiplication with addition. In contrast to standard BP, the 
derived update equations (i) come with exact convergence 
guarantees, (ii) allow a closed-form solution, {Hi) keep the 
derived beliefs normalized at each step, and {iv) can thus be 
put into an efficient linear algebra framework. We also show 
empirically that this approach - in addition to its compelling 
computational advantages - performs comparably to Loopy 
BP for a large part of the parameter space. In contrast to prior 
work on linearizing BP, we remove any restriction on the po¬ 
tentials and solve the most general case for pairwise MRFs 
(see Fig. 1). Since it is known that any higher-order MRF 
can be converted to a pairwise MRF (Wainwright and Jor¬ 
dan 2008, Appendix E.3), the approach can be also be used 
for higher-order potentials. Our formalism can thus model 
arbitrary heterogeneous networks', i.e., such that have di¬ 
rected edges or have different types of nodes.^ This gen¬ 
eralization is not obvious and required us to solve several 
new algebraic problems: {i) Non-symmetric potentials mod¬ 
ulate messages differently across both directions of an edge; 
each direction then requires different centering points (this 
is particularly pronounced for non-quadratic potentials; i.e., 
when nodes adjacent to an edge have different numbers of 
classes), {ii) Multiplying belief vectors with non-stochastic 

^A potential is “doubly stochastic” if all rows and columns sum 
up to 1. As potentials can be scaled without changing the semantics 
of BP, this definition also extends to any potential where the rows 
and columns sum to the same value. 

^Notice that an underlying directed network is still modeled 
as an undirected Graphical Model (GM). For example, while the 
“friendship” relation on Facebook is undirected, the “follower” re¬ 
lation on Twitter is directed and has different implications on the 
two nodes adjacent to a directed “links to”-edge. Yet, the resulting 
GM is still undirected, hut now has asymmetric potentials. 
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Figure 1: The approach proposed in this paper combines the full 
expressiveness and generality of Loopy Belief Propagation (BP) 
on pairwise MRFs with the computational advantages of Fast BP 
(Koutra et al. 2011) and Linearized BP (Gatterbauer et al. 2015). 


potentials doesn’t leave them stochastic; an additional nor¬ 
malization would then not allow a closed-form matrix for¬ 
mulation as before; we instead derive a “bias term” that re¬ 
mains constant in the update equations and thus depends 
only on the network structure and the potentials but not the 
beliefs. {Hi) Dealing with the full heterogenous case (mul¬ 
tiple potentials, multiple node types, and different numbers 
of classes among nodes) requires a considerably more gen¬ 
eral formulation. The technical report on arXiv (Gatterbauer 
2015) contains the full derivation of the results presented in 
this paper. An efficient Python implementation is available 
on Github (SSLH 2015). 

2 BP for pairwise MRFs 

A MRF is a factored representation of a joint distribution 
over variables X. The distribution is defined using a set of 
factors {</</ I / G F}, where each / is associated with the 
variables Xy C X , and is a function from the set of 
possible assignments of Xy to IR+. The joint distribution is 
defined as: P(X = x) = where Z is a 

normalization constant known as the partition function. 

An important subclass of MRFs is that of pairwise MRFs, 
representing distributions where all of the interactions be¬ 
tween variables are limited to those between pairs. More 
precisely, a pairwise MRF over a graph is associated with 
a set of node potentials and a set of edge potentials (Roller 
and Friedman 2009). The overall distribution is the normal¬ 
ized product of all of the node and edge potentials. 

We next focus on the mechanics of BP. Consider a net¬ 
work of n nodes where each node s can be any of kg possi¬ 
ble classes (or values). A node s maintains a fcg-dimensional 
belief vector where each element j represents a weight pro¬ 
portional to the belief that this node belongs to class j. Let 
Xg be the vector of prior beliefs (also varyingly called local 
evidence or node potential) and the vector of posterior (or 
implicit or final) beliefs at node s, and require that Xg and 
are normalized to 1; i.e., Eye[/c,] = Eje[fc,] Vsij) = 
1. For example, a labeled node s of class i is represented by 
Xg{j) = 1 for j = i and Xg{j) = 0 for j ^ i. Using m„s 
for the /cg-dimensional message that node u sends to node 
s, we can write the BP update equations (Murphy 2012; 
Weiss 2000) for the belief vector of a node s as: 

VsU) G- YXg{j) mug{j) (1) 

® uGN{s) 



Here, we write Zg for a normalizer that makes the elements 
of yg sum up to 1. Thus, the posterior belief t/s(j) is com¬ 
puted by multiplying the prior belief Xgij) with the incom¬ 
ing messages rriusij) from all neighbors u G N{s), and then 
normalizing so that the beliefs in all kg classes sum to 1. In 
parallel, each node sends messages to each of its neighbors: 

^ Xg{j) TT mus{j) ( 2 ) 

3 ueN(s)\t 

Here, ipgt {j, i) is a proportional “coupling weight” (or “com¬ 
patibility,” “affinity,” “modulation”) that indicates the rela¬ 
tive influence of class j of node s on class i of node t. Thus, 
the message mgt{i) is computed by multiplying together all 
incoming messages at node s - except the one sent by the 
recipient t - and then passing through the edge po¬ 
tential. Notice that we use Zgt in Eq. (2) as a normalizer 
that makes the elements of nisi sum up to kt at each itera¬ 
tion. As pointed out by Murphy, Weiss, and Jordan; Pearl, 
normalizing the messages has no effect on the final beliefs; 
however, this intermediate normalization of messages will 
become crucial in our derivations. BP then repeatedly com¬ 
putes the above update equations for each node until the 
values (hopefully) converge. At iteration r of the algorithm, 
yg{j) represents the posterior belief of j conditioned on the 
evidence that is r steps away in the network. 

3 Linearizing BP over any pairwise MRF 

This section gives a closed form description for the final be¬ 
liefs after convergence of BP in arbitrary pairwise MRFs 
under a certain limit consideration of all parameters. This 
is a strict and non-trivial generalization of recent works 
(Fig. 1). The difficulty of our generalization lies in techni¬ 
cal details: non-symmetric potentials require different cen¬ 
tering points for messages across different directions of an 
edge; non-stochastic potentials require different normalizers 
for different iterations (and for different potentials in the net¬ 
works) which does not easily lead to a simple matrix formu¬ 
lation; and the full heterogenous case (e.g., different num¬ 
ber of classes k for different nodes) requires a considerably 
more general derivation and final formulation. 

Our approach is conceptually simple: we center all ma¬ 
trix entries around well-chosen default values and then focus 
only on the deviations from these defaults using Maclau- 
rin series at several steps in our derivation. The resulting 
equations replace multiplication with addition and can thus 
be put into the framework of matrix-vector multiplication, 
which can leverage existing highly-optimized code. It also 
allows us to give exact convergence criteria for the resulting 
update equations and a closed form solution (that would re¬ 
quire the inversion of a large matrix). The approach is sim¬ 
ilar in spirit to the idea of writing any MRF (with strictly 
positive density) as log-linear model. However, by starting 
from the update equations for loopy BP, we solve the in¬ 
tractability problem by ignoring all dependencies between 
messages that have traveled over a path of length 2 or more. 

Definition 1 (Centering). We call a vector x or matrix X 
“centered around c with standard deviation v ” if the average 
entry /r(x) = c and standard deviation cr(x) = v. 


Definition 2 (Residual vector/matrix). If a vector x is cen¬ 
tered around c, then the “residual vector” x around c is de¬ 
fined as x = [xi — c, X 2 — c ,...]'''. Accordingly, we denote a 
matrix iX.as a “residual matrix” if each entry is the residual 
after centering around c. 

For example, the vector x = [1.1,1.2,0.7]''' is centered 
around c = 1, and the residuals from 1 form the residual 
vector X = [0.1,0.2,—0.3]'''; i.e., x = I 3 -f x, where I 3 
is the 3-dimensional vector with all entries equal to 1. By 
definition of a normalized vector, beliefs for any node s are 
centered around ^, and the residuals for prior beliefs have 
non-zero elements (i.e., Xg f 0 ^^) only for nodes with local 
evidence (nodes “with explicit beliefs”). Further notice that 
the entries in a residual vector or matrix always sum up to 
0 (i.e., 'Yhi = 0)- This is done by construction and will 
become important in the derivations of our results. 

The main idea of our derivation relies then on the fol¬ 
lowing observation: if we start with messages and potentials 
with rows and columns centered around 1 with small enough 
standard deviations, then the normalizer of the update equa¬ 
tion Eq. (2) is independent of the beliefs and remains con¬ 
stant as Zgt = kf^. Importantly, the resulting equations 
do not require further normalization. The derivation further 
makes use of certain linearizing approximations that result 
in a well-behaved linear equation system. We show that the 
MM solutions implied by this equation system are identi¬ 
cal to those from the original BP update equations in case 
of nearly uniform priors and potentials. For strong priors 
and potentials (e.g., [j^qq ^i**]), the resulting solutions are 
not identical anymore, yet serve as reasonable approxima¬ 
tions in a wide range of problem parameters (see Section 4). 
WLOG, we start with potentials that are centered around 1 
and then re-center the potentials before using them:"^ 

Definition 3 (Row-recentered residual matrix). Let rp G 
j^txk centered around 1 and tp be the residual matrix 
around 1. Furthermore, let r{j) '■= V’Oi*) be the sum 
of the residuals of row j. Then the “row-recentered residual 

matrix” ip has entries {j, i) '■= j: (fij, i) — ^~^)- 

Before we can state our main result, we need some ad¬ 
ditional notation. WLOG, let [n] be the set of all nodes. 
For each node s G [n], let kg be the number of its pos¬ 
sible classes. Let := i.e., the fcg-dimensional 

uniform stochastic column vector. Furthermore, let fctot '= 
X]s 6 [n] classes across nodes. To write all 

our resulting equations as one large equation system, we 
stack the individual explicit (x) and implicit (y) residual be¬ 
lief vectors together with the k^-vectors one underneath the 
other to form three fctot-dimensional stacked column vec¬ 
tors. We also combine all row-recentered residual matrices 
into one large but sparse [fctot x A:tot]-square block matrix 

‘'without changing the joint probability distribution, every po¬ 
tential in a MRF can be scaled so that the average entry is 1. For 

1 - 1-1 

example, given tp = [ g g 7 ], we scale by | to get ip = M 4 7 , 

“ L 1 3 6 J 

which has the identical semantics but is now centered around 1 . 



(notice that all entries for non-existing edges remain empty): 
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We can now state our main theorem: 

Theorem 4 (Linearizing Belief Propagation). Let y, x, k, 

- / 

and rp be the above defined residual vectors and matrix. 
Let e be a bound on the standard deviation of all non-zero 
entries o/rp and x, (T(tp ) < e and (j(x) < e. Let y^^ 
be the final belief assignment for any node v after con¬ 
vergence of BP. Then, for arg max^ = 

arg maxi yy"{i), where y„ results from solving the follow¬ 
ing system o/fctot linear equations in y: 

y= X -fip'^k + rp'^y-rp'^^y (3) 

•2nd ^rd y^th 

In other words, the MM node labeling from BP can be ap¬ 
proximated by solving a linear equation system if each of the 
potentials and each of the beliefs are reasonably tightly cen¬ 
tered around their average values. Notice that the 2"‘* term 

rp k is a “bias” vector that depends only on the structure of 
the network and the potentials, but not the beliefs. We thus 

sometimes prefer to write := tp k to emphasize that it 
remains constant during the iterations. This term vanishes 
if all potentials are doubly stochastic. Also notice that the 
4* term is what was called the “echo cancellation” in (Gat- 
terbauer et al. 2015).^ Simple algebraic manipulations then 
lead a closed-form solution by solving Eq. (3) for y: 

y = {x + c^) (4) 

Iterative updates and convergence 

The complexity of inverting a matrix is cubic in the num¬ 
ber of variables, which makes direct application of Eq. (4) 
difficult. Instead, we use Eq. (3), which gives an implicit def¬ 
inition of the final beliefs, iteratively. Starting with an arbi¬ 
trary initialization of y (e.g., all values zero), we repeatedly 
compute the right hand side of the equations and update the 
values of y until the process converges:® 


^Notice that the BP update equations send a message across an 
edge that excludes information received across the same edge from 
the other direction: “rt £ N{s)\t” in Eq. (2). In a probabilistic 
scenario on tree-based graphs, this echo cancellation is required for 
correctness. In loopy graphs (without well-justified semantics), this 
term still compensates for the message a node t would otherwise 
send to itself via a neighbor s, i.e., via the path f > s —>■ f. 

^Interestingly, our linearized update equations, Eq. (5), are rem¬ 
iniscent of the update equations for the mean beliefs in Gaussian 
MRFs (Malioutov, Johnson, and Willsky 2006; Su and Wu 2015; 
Weiss and Ereeman 2001). Notice however, that whereas the up¬ 
date equations are exact in the case of continuous Gaussian MRFs, 
our equations are approximations for the general discrete case. 


Proposition 5 (Update equations). The positive fix points 
for Eq. (3) can be calculated iteratively with the following 
update equations starting from y^^'^ = 0.' 

yU-tl) ^ c',) + _ ip''^^)y(’') (5) 

These particular update equations allow us to give a suffi¬ 
cient and necessary criterium for convergence via the spec¬ 
tral radius p of a matrix.^ 

Corollary 6 (Convergence). The update Eq. (5) converges 

, - / - / 2 . 

if and only if py\> — i[> ) < 1- 

Thus, the updates converge towards the closed-form solu¬ 
tion, and the final beliefs of each node can be computed via 
efficient matrix operations with optimized packages, while 
the implicit form gives us guarantees for the convergence of 
this process.^ In order to apply our approach to problem set¬ 
tings with spectral radius bigger than one (and thus direct ap¬ 
plication of Eq. (5) would not work), we propose to modify 

the model by weakening the potentials. In other words, we 
- / 

multiply rp with a factor that guarantees convergence. We 
call the multiplicative factor which exactly separates con¬ 
vergence from divergence, the “convergence boundary” e*. 
Choosing any e with s := f- and s < 1 guarantees conver¬ 
gence. We call any choice of s the “convergence parameter.” 

Definition 7 (Convergence boundary e*). For any rp , the 
convergence boundary e* > 0 is defined implicitly by 

, -I o " ' 2 . 

P(e*tp - ejrp ) = 1. 

Computational complexity 

- / 

Naively materializing rp would lead to a space requirement 
of where n is the number of nodes and fcmax 

the max number of classes per node. However, by using a 
sparse matrix implementation, both the space requirement 
and the computational complexity of each iteration are only 
proportional to the number of edges: 0{mk^^^. The time 
complexity is identical to the one of message-passing with 
division, which avoids redundant calculations and is faster 
than standard BP on graphs with high node degrees (Roller 
and Eriedman 2009). However, the ability to use existing 
highly-optimized packages for efficient matrix-vector multi¬ 
plication will considerably speed-up the actual calculations. 

4 Experiments 

Questions. Our experiments will answer the following 3 
questions: (1) What is the effect of the convergence param¬ 
eter s on accuracy and number of required iterations until 
convergence? (2) How accurate is our approximation under 

^The “spectral radius” p(-) of a matrix is the supremum among 
the absolute values of its eigenvalues. 

*The intuition behind these equivalences can be illustrated by 
comparing to the geometric series S = 1 -y x -y -y ... and 
its closed form S = (1 — Whereas for \x\ < 1, the se¬ 

ries converges to its closed-form, for |x| > 1, it diverges, and the 
closed-form is meaningless. 




varying conditions; (i) the density of the network, (ii) the 
strength on the interaction, and (in) the fraction of labeled 
nodes? (3) How fast is the linearized approximation as com¬ 
pared to standard Loopy BP? 

Experimental protocol. We define “accuracy” as the 
fraction of unlabeled nodes that receive correct labels. In or¬ 
der to evaluate the accuracy of a method, we need to use 
graphs with known label ground truth (GT). As we are in¬ 
terested in the accuracy as a function of various parame¬ 
ters, we need graphs with controlled GT. We thus decided 
to compare BP against its linearization on synthetic graphs 
with known GT, which allows us to measure the accuracy 
as result of systematic parameter changes. The well-studied 
stochastic block-model (Airoldi et al. 2008) leads to net¬ 
works with degree distributions that are not similar to those 
found in most empirical network data. Our synthetic graph 
generator is thus a variant thereof with two important dif¬ 
ferences: (1) we actively control the degree distributions in 
the resulting graph; and (2) we “plant” exact graph prop¬ 
erties (instead of fixing a property only in expectation). In 
other words, our generator preserves desired degree distri¬ 
bution and compatibilities between classes. The online ap¬ 
pendix (Gatterbauer 2015) contains all details. We focus on 
the scenario of a network with one non-symmetric potential 
along each edge. The generator creates a graph using a tuple 
of parameters (n, to, a, rp, dist), where n is the number of 
nodes, to is the number of edges, a is the node label distri¬ 
bution with a{i) being the fraction of nodes of class i, rp is 
the edge potential, and dist is a chosen degree distribution 
(e.g., uniform or power law with chosen coefficient). 

Parameter choices. Throughout our experiments, we use 
k = 3 classes and the potential rp = i i h , parameter- 

ized by a value h representing the ratio between min and 
max entries. Dividing by (2 + h) centers it around 1. Thus 
parameter h models the strength of the potential, and we 
expect higher values of h to make our approximation less 
suitable. Notice that this matrix is not symmetric and shows 
very different modulation behavior across both directions of 
an edge. We create graphs with n nodes and assign the same 
fraction of nodes to one of the 3 classes: a = [j, ^, ^]. We 
also vary the parameters to and d = ^ as the average in- 
and outdegree in the graph, and we assume a power law dis¬ 
tribution with coefficient 0.5. We then keep a fraction / of 
node labels and measure accuracy on the remainder. 

Computational setup. All methods are implemented in 
Python and use the optimized SciPy library (Jones et al. 
2001) to handle sparse matrix operations. The experiments 
are run on a 2.5 Ghz Intel Core i5 with 16G of main mem¬ 
ory and a 1TB SSD hard drive. To allow comparability 
across implementations, we limit evaluation to one proces¬ 
sor. For timing BP, we use message-passing with division 
which is faster than standard BP on graphs with high node 
degree (Roller and Friedman 2009). To calculate the approx¬ 
imate spectral radius of a matrix, we use a method from 
the PyAMG library (Bell, Olson, and Schroder 2011) that 
implements a technique described in (Bai et al. 2000). Our 
code, including the data generator, is inspired by Scikit-leam 
(Pedregosa et al. 2011) and is available on Github to encour¬ 


age reproducible research (SSLH 2015). 

Question 1. What is the effect of scaling parameter s on 
accuracy and number of iterations for convergence? 

Result 1. Our scaling parameter s gives an exact criterion 
for our approach to converge. In contrast, BP often does 
not converge and requires a lot of fine-tuning; e.g., damp¬ 
ing or even scaling of the potential. The accuracy of the 
linearization is highest for s close or slightly above 1 and 
by not iterating until convergence. 

Figure 2a shows the number of required iterations to 
reach convergence and confirms our theoretical results from 
Corollary 6. In case the convergence condition does not 
hold, we scale the centered potential by a value e, result¬ 
ing from e = s • e» with s < 1. This action weakens the 
potentials, but preserves the relative affinities (we also use 
the same approach to help BP find a fixed point if it does 
not converge within 200 iterations). Figure 2b shows what 
happens to accuracy if we run the iterative updates a fixed 
number of times as a function of s. Notice that even consid¬ 
erably scaling a potential does not entirely change the model 
and still gives reasonable approximations. The figure fixes 
a number of iterations, but then varies again e via s. Also 
interestingly, almost all of the performance gains from the 
linearized update equations come from running just a few 
iterations, and convergence for optimal labeling is not nec¬ 
essary; instead, by choosing s « 1 (at the exact boundary of 
convergence) or even s > 1 and iterating only a few times, 
we can maximize the expected accuracy. For the remaining 
accuracy experiments, we use s = 0.5 and run our algorithm 
to convergence. 

Question 2. How accurate is our approximation, and un¬ 
der which conditions is it reasonable? 

Result 2. The linearization gives comparable labeling ac¬ 
curacy as LBP for graphs with weak potentials. The per¬ 
formance deteriorates the most in dense networks with 
strong potentials. 

We found that h, d and / have important influence on 
the labeling accuracy of BP and its linearization (whereas 
n, dist and a influence only to a lesser extent). Figures 2c 
and 2d show accuracy as a function of the fraction / of la¬ 
beled nodes. Notice that we chose the best BP was able to 
perform (over several choices of e and damping factors to 
make it converge) whereas for LinBP we consistently chose 
s = 0.5 as proposed in (Gatterbauer et al. 2015). Figures 2e 
to 2g show labeling quality as a function the strength h of 
the potential. For strong potentials (h > 3), BP gives better 
accuracy if it converges. In practice, BP often did not con¬ 
verge within 200 iterations even for weak potentials (bor¬ 
dered data points required dampening; red crosses required 
additional entry-wise scaling of the potential with our con¬ 
vergence boundary e*). In our experiments, BP often did not 
converge despite using damping, surprisingly often when h 
is not big. It is known that if the potentials are close to in¬ 
difference then loopy BP usually converges. In this case, our 
formalism is equivalent to loopy BP (this follows from our 
linearization). Thus, whenever loopy BP did not converge. 
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Figure 2: Experimental results for BP and its linearization (abbreviated here as “Lin”): / represents the fraction of labeled nodes, h the strength 
of potentials, and d the average node in- and outdegree. All graphs except for (h) have n = 1000 nodes, (a): The convergence parameter s 
exactly determines convergence for the linearization, (b): Accuracy increases for s close to 1 and few iterations, (c, d): / only marginally 
affects the relative accuracy between BP and its linearization, (e, f, g): For strong potentials, BP gives better accuracy if it converges. In 
practice, BP often did not converge within 200 iterations even for weak potentials and required a lot of fine-tuning (damping and/or entry- 
wise scaling of the potential with our convergence boundary e* to s = 1). (h): Each iteration of our approach is 50 times faster than an 
implementation of BP with division. In addition, deploying a proper damping strategy often requires 100s of iterations, which can bring up 
the total speed-up to a factor 1000 for some of the above data points. (Each data point results from at least 10 samples). 


we simply exponentiated the entries of the potential with a 
varying factor e until BP converged. Thus for high h, BP can 
perform better than the linearization, but only after a lot of 
fine-tuning of parameters. In contrast, for our formulation 
we know exactly the boundary of convergence. 

Overall, the linearization gives comparable results to the 
original BP for small potentials, and BP performance is bet¬ 
ter than the linearization only either for strong potentials 
with h > 3 and dampening (see a few yellow dots without 
borders as exceptions) or after fine-tuning BP after using our 
own convergence boundary and scaling the potentials, or af¬ 
ter a lot of manual fine-tuning. 

Question 3. How fast is the linearized approximation as 
compared to BP? 

Result 3. The linearization is around 100 times faster than 
BP per iteration and often needs 10 times fewer iterations 
until convergence. In practice, this can lead to a speed-up 
of 1000 times. 

A key advantage of the linearization is that it has pre¬ 
dictable convergence and comes with considerable speed- 
ups. Figure 2h shows that our approach scales linearly in the 
number of edges and is 50 times faster than regular loopy BP 
per iteration', an iteration on a graph with 3 million nodes 
and 30 million edges takes less than 2 sec. Calculating the 
exact convergence boundary via a spectral radius calcula¬ 


tion can take more time (approx. 1000 sec for the same 
graph). Notice that any dampening strategy for BP results 
in increased number of iterations and needs to overcome the 
additional slow-down of further iterations. Also recall that 
on each circled point in Figs. 2e to 2g, BP did not con¬ 
verge within 200 iterations and required dampening; each 
red cross required additional scaling of the potentials with 
our calculated e* in order to make BP converge. 

5 Conclusions 

We have derived a linearization of BP for arbitrary pairwise 
MRFs for the purpose of node labeling with MM-inference. 
The approach transforms the parameters of an MRF into a 
linear equation system that can be solved with simple iter¬ 
ative updates. These updates come with exact convergence 
guarantees, allow a closed-form solution, keep the derived 
beliefs normalized at each step, and can thus be put into an 
efficient linear algebra framework that does not require nor¬ 
malization at each step. Experiments on carefully controlled 
synthetic data with known ground truth show that our ap¬ 
proach performs comparably with Loopy BP for weak po¬ 
tentials and comes with a predictable behavior, compelling 
computational advantages, and an easy implementation with 
only few lines of code. An unexplored application of the lin¬ 
earization may be speeding-up convergence of regular BP 
by starting from good approximations of its fixed points. 
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A Nomenclature 


n number of nodes 

m number of edges 

s, t, u indices used for nodes 
N{s) set of neighbors of node s 

k, I number of classes, ks is number of classes for node s, k{s) is the class of node s 
i,j,g indices used for classes 

(r) index used for iteration 

Xs fes-dimensional prior (or explicit) belief vector of node s 

fcs-dimensional posterior (or implicit or final) belief vector of node s 
irist fct-dimensional message vector from node s to node f (s —>■ f) 

tp £ X k potential (or coupling matrix): i) indicates the influence of class j of the sending node on class i of the receiving node. 

WLOG potentials are scaled to be centered around 1. 

Y: y{j ) The hat notation “" ” indicates residuals after centering 
^ 1^11 . .. 

Ip , ip row-recentered “ ” or column-recentered “ ” residual potential 

V' set of all row-recentered residual potentials 

Q set of node types 

q index used for node types 

fc(( 7 ) number of classes for node type q{q £ Q) 

K set of numbers of classes across all nodes: K = k{2),... fc(|(5|)} 

Nk set of nodes with k classes 

Uk number of nodes with k classes 

o(s) order (sequence) of node s within Nk^ 

'Kk , Yfc n X k prior or posterior belief matrix: X (o(s), j ) indicates the belief in class j by node s 

W,p ni X Uk adjacency matrix for edges with potential rp G lY(o(s), o(f)) 7 ^ 0 indicates an edge s —> f that carries the potential 

Ik k X k identity matrix 

transpose of matrix X 
vec (X) vectorization of matrix X 
X (g) Y Kronecker product between matrices X and Y 

X© Y Hadamard product or component-wise multiplication: Z = X 0 Y <;=^> Z(i, j) = X{i, j) ■ Y{i,j) 

X 0 Y Component-wise division: Z = X 0 Y Z{i,j) = X{i,j)/Y{i,j) with 0/0 = 0 
T normalizer 

Ifc fe-dimensional column vector with all entries equal to 1 

[a:]rxfc £ x k matrix with all entries equal to x 
a fe-dimensional node label distribution 

[fc] [fc] :={l,2,...,fc}_ 


Figure 3: Nomenclature 


Given a matrix X, we write X{i,j) for one scalar entry, X(i,:) for the i-th row vector, and X(:,/) for the j-th column vector. 
We also write short form for J2j^[k] whenever k is clear from the context. 

B Derivation of the linearization of BP over any pairwise MRFs 

This section contains the derivation of Theorem 4. We will center the elements of all message and belief vectors around their 
“natural default values,” i.e., the elements of nist around 1, and the elements of Xg, and around ^ (Lemma 10 will provide 
some intuition why our chosen center points are the natural choice to simplify all later derivations). We are interested in the 
residual values defined by := m{i) - 1, Xs{j) ■= XsU) - and ys{j) := ys{j) - 

WLOG, we start from a potential rp G that is centered around 1 (Recall that we can scale any potential with a positive 
real number without changing the semantics of the MRF). We then appropriately re center a potential differently across both 
directions of an edge as to make it singly stochastic for either direction and most of the residual terms for the belief update 
equations cancel each other out, leading to simplified equations. Definition 3 provided the definition for the residual matrix in 
one direction, row-recentering. Adding to that definition, the row-recentered stochastic matrix ip^ is centered around ^ and has 
entries ip'Uy i) '■= *) + p Both matrices are indicated with a single apostrophe '. 

Analogously, let c{i) = be the residual sum of column i. Then, a column-recentered residual matrix ip has 

entries := and the column-recentered stochastic matrix rp" has entries '■= j + '*/'”(/> 0- 

Notice that both matrices are indicated with a double apostrophe ". The resulting recentered residual potentials are coupling 
matrices that make explicit the relative attraction and repulsion of neighboring nodes. For example, the sign of tells 




i=l i=2 

'•(i) 

i=l 1=2 

rUY 

i=l i=2 

rur 

3=^ 

0 96 0 98 

1 94 

0 495 0 505 

1 

0.323 0.323 

0.646 

i=2 

0 99 1 01 

2 

0 495 0 505 

1 

0.333 0.333 

0.666 

j=:^ 

1 02 1 04 

2 06 

0 495 0 505 

1 

0.343 0.343 

0.686 

c{i) 

2 97 3 03 

s=6 c(i)' 

1 485 1 515 

3 c(i)" 

1 1 

2 


^ 1 ^ ^ 1 ^ ^ 1 ^ 


BlHilil lyMIiiilil 




fii) 


t‘uy 


fur 


0 04 

0 02 

0 06 

0 005 0 005 

0 

0 01 

0 01 

0 02 

f=2 

0 01 

0 01 

0 

0 005 0 005 

0 

0 

0 

0 


0 02 

0 04 

0 06 

0 005 0 005 

0 

0 01 

0 01 

0 02 

c(i) 

0 03 

0 03 

II 

0 

n> 

0 015 0 015 

0 c(i)" 

0 

0 

0 




column recentering 




Figure 4: Example 9: Matrix rp G centered around 1, residual matrix tp, and row-recentered and column-recentered residual matrices 

^1^1! in 

Ip and tp (and stochastic matrices tp and ip ). 



Expression/Maclaurin series/Approximation 

Logarithm 

ln(l-|-e) = £— ^-1-^ —...«e 

Product 

(1 -1- ei)(l -1- £ 2 ) = l-|-£l-|-£ 2 -|-£l £2 « l-|-£l -|-£2 

Division 

= (K£0(l-C2+£i-. ..) « 

Scaling 

(1-£2)-1«1 


Figure 5: Table of our linearizing approximations. 


us if the class j attracts or repels class i in a neighbor, and the magnitude of i) indicates the extent, 
centering allows us to rewrite belief propagation in terms of the residuals. 

Notice that column-recentering and row-recentering are connected via the transpose. However, message 
one direction of an edge is is not simply the transpose of the modulation across the other direction; 

Corollary 8 (Row-recentering vs. column-recentering), (tp )''' = (rp^)'. In particular, 

4^ It = 

We also write the f-dimensional vector r := tp for the row sums, the fc-dimensional vector c := tp^ \i for the column 
sums, and s := r(j) = Ij r = Ij tp 1^. for the sum of all entries in a matrix. We illustrate recentering next with a detailed 
example. 

Example 9 (Recentering). Figure 4 shows the 3x2 matrix tp that is centered around 1 (i.e., each entry is close to 1 and the 
average is exactly 1) together with the row sums r(j) and the column sums c(i). tp is then the residual matrix. Notice that the 

recentered residual matrices tp and tp have zero row sums f[j)' or column sums 0 ( 1 )”, respectively. As consequence, the 
row-recentered matrix tp^ and column-recentered matrix tp" are row-stochastic or column-stochastic, respectively. 

We will further make use of the linearizing Maclaurin series approximations shown in Fig. 5 to derive a well-behaved linear 
equation system. 

Recentering 

The following lemma provides the mathematical justihcation for our particular choice of recentering: 


Subsequently, this 
modulation across 




































Lemma 10 (Recentering). Consider the update equation 


y ^ ^ (6) 

with X being a (-dimensional stochastic vector, rp € being centered around 1, and Z a normalizer that makes the 

elements of the resulting k-dimensional vector y sum up to k. Then, the update equation can be approximated with the 
row-recentered stochastic matrix rp^ by 

y k ip^^x (7) 

Proof Lemma 10. Our proof will express both equations (Eq. ( 6 ) with tp and Eq. (7) with rp') in terms of the residual matrix 
- / 

rp , and show that they lead to the same equation. Erom Definition 3 and the definitions at the beginning of Appendix B, 
we know that = 1 + ip{j,i) and = kipfjf) + Therefore, ip{j,i) = 1 + kipfj,i) + Similarly, 

i +■*/’'(j,*)- 

In the following, we are going to use matrix notation that allows us to express the above identities very compactly as: 
rp"^ = IfclJ + ilfcF + fcrp tp'^r = +rp^,andx= yl^+x.® 

(i) Equation (6): We calculate y in two steps that treat the normalization separately: first z = tp^x, and then y = yZ. 


z = tp^x 

= (ifelj + + fctp + X^ 


rp +lfc Ijx +ilfcFx + fcrp ^x 


1 ..T ~ ^ 7 ' 'T 

= Ifc +-lfcr'x+-c + fcrp X 
fc ( 

We next calculate the value of the normalizes Recall that the normalizer makes the entries of the vector z sum up to fc. 


^ ~ fc^^^ ~ fc 


^fc + Fx + ^ flc +fc lj!rp ^x 


We see that the normalizer is not a constant but also depends on rp and x. However, notice that if each row of rp is centered 
around 1 (not just the matrix as a whole), then f{j) = 0 for all rows, and thus Z = 1. In the following, we approximate 
1/(1 + e) « (1 - e) and (1 + ei)(l + €2) « 1 + ei - € 2 - 


/ 1 

k 

. ; 'T -v \ 

/, Fxx 

\lk + -lfcFx + 

r 

+ fcrp xj 

(i+y) 


k 


/ F’xx 

(^Ifc + -lfcFx + 

r 

+ fcrp xj 



« Ifc +^pL^rFx + -c - 
= Ifc + jc + fcrp "^x 

Notice that the above equation is exact if f (j) = 0 for all rows. 
(ii) Equation (7): Here we get the same result much faster: 

y = fcrp^^x 




(laj+fctp'^) • [jle 


-„'ke + X 


^ /J 1 ^ /J 

= lfc + fcrp lf- + lfcl/x +fcrp X 


= Ifc + + fcrp "^x 


quick illustration: = Y [1, 0.06,0, 0.06] = | [-o^og 0 o'oe] for rp in Fig. 4. 




It follows that Eq. (7) is an approximation of Eq. (6), in general, and both equations are equivalent if each row in rp is centered 
around 1. □ 


Notice that, since y{j) = 1 + y{j), we can express the update equation in terms of residuals as 



Eurther notice that if each column in the original potential is centered around 1, then the term c disappears. 


Overall, Lemma 10 implies that by recentering the coupling matrix, we can replace the normalize!' with a constant, which 
considerably simplihes our later derivations. The proof also showed that the approximation becomes exact if each row in tp is 
centered around 1. 

Example 11 (Recentering (Example 9 continued)). Consider again matrix ip G in Fig. 4: The matrix is centered 

around 1 as the sum of its entries is s = 6. However, row 1 is not centered around 1 as its row sum r(l) = 1.94 instead of 
2. Next assume x = [0.1,0.1, 0.8]'''. Then y = [0.99021,1.00979]"'' for Eq. (6), but y = [0.99,1.01]''' with Eq. (7). Thus, the 
residuals are ±0.00979 and ±0.01, respectively, and the relative difference « 2%. 


Centered BP 

By using the previous lemma and focusing on the residuals only, we can next transform the belief update equations from 
multiplication into addition; 

Lemma 12 (Centered BP). By appropriately centering the coupling matrix, beliefs and messages, the equations for belief 
propagation can be approximated by: 



'S 


(8) 


uGN{s) 



(9) 


J 


Proof Lemma 12. (i) Equation (8); Substituting the expansions into the belief updates Eq. (1) leads to 



\n{l + ksiisii)) < - \nZs+ \n{l + ksXsij)) + ^ \n {l + rhus{j)) 




( 10 ) 


u^N(s) 


Summing both sides over j gives us: 



j 


3 


j uGN{s) 




Hence, we see that In Zs needs to be 0, and therefore our normalizer is actually a normalization constant and for all nodes 
Zs = 1. Plugging Zs = 1 back into Eq. (10) leads to Eq. (8): 


uGN{s) 

(ii) Equation (9): Using Lemma 10, we can write Eq. (2) as follows (recall that kt and "0^^ take care of the normalization); 




J 


uGN{s)\t 







k, 



Figure 6 : Matrix represents the edge potential from Eq. (15) and implies the direction s —> f. By ignoring the “echo cancellation” term 
in Eq. (16), one can think of the messages as iiist oc and mta oc 


By further using Eq. ( 8 ), we get: 
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Then, using the centering together with the approximation '= ~ ^ ' 
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Equations ( 8 ) and (9) can be written in matrix notation as: 


^ uGN{s) 

kt , , I'T/ 

—c,t + kt^\>,t\. 




- T^t, 


An alternative way to write the message updates is 

kt 


- 7«‘^T/-s 1 X-^ 

^ ^ ueN(s)\t 


□ 

( 11 ) 

( 12 ) 

(13) 


It is instructive to compare the above derived linearized update equations against the matrix formulations of the original BP 
update equations Eq. (1) and Eq. (2): by using the symbol © for the Hadamard product'*', those can be written compactly in 
matrix notation, as: 




(3 


m„ 


«eAf(s) 


msr ^ 0 Q 


u£N{s)\t 


(14) 

(15) 


*’The Hadamard product is defined by: Z = X© Y Z{i, j) = X{i, j) ■ Y{i,j). 





Notice that the potential is represented hy a kg x /c;-dimensional “compatibility matrix” and that the transpose = 
rpjg (see Fig. 6). This follows from the definition of a potential in a pairwise MRF and the resulting derivation of belief 
propagation (Yedidia, Freeman, and Weiss 2003). Also notice that we could reduce the amount of necessary calculation by first 
multiplying all incoming messages at a node, and then dividing through the message that a node sends to itself via a neighbor 
(we call this compensation “echo cancellation”). This approach is also called “message-passing with division” (Roller and 
Friedman 2009) (or “belief-update message passing”) and can be made precise by defining a component-wise division operator 
by:Z = X 0 Y Z{i,j)=X{i,j)/Y{i,j) where 0/0 = 0. Equation (15) can then be written more concisely as: 


Ttnst ^ 


1 

Zst 



Js 0 m* 


( 16 ) 


We invite the reader to carefully compare Eqs. (11) and (12) with the original BP update equations Eqs. (11) and (13). Notice 
that the first term in Eq. (12) vanishes in the case of doubly stochastic potentials (or more generally, potentials with equal 
column and row sums). Eor non-doubly stochastic potentials, this term captures the prior probabilities of node classes resulting 
from non-equivalent row or column sums. 


Steady state messages 


Erom Lemma 12, we can derive a closed-form equation for the message in steady-state of belief propagation. 


Lemma 13 (Steady state messages). After convergence of belief propagation, message propagation can be approximated in 
terms of the steady centered beliefs as: 



(17) 


Proof Lemma 13. To increase the readability of this proof, we ignore the subscripts in c^t, Ygt, and write instead rp, c, r, 
respectively. We start by writing Eq. (9) for the messages in each of the two directions across the same edge s — t\ 

mtsij) ^ + kg^f”{g,j)(yt{g) - ^mgtig)) 

‘ 9=1 * 

We then simply combine both equations into one: 


k 

fhstii) ^ -Acii)' 3-kt'^'ip'{j,i)ysU) 

- T 


k 

k 


Ks 

S . T 


(^i'(j)"+ A'igJ) yt{g) “ ^ ^ A'{g, j) rhst{g'^ 

* 9=1 * 9=1 

Now, if the equations converge, then rhst{g) on both the left and right side of the equation need to be equivalent. We can, 
therefore, replace the update symbol with equality and group related terms together: 


“ ^^'^Aij,i)'^A'ig,j)rhgtig) = 

kt 


tAA - ^X !+ ktY^ A{j,i)ys{j) - Y A'igJ)yt{g) 

With Ikt the /ut-dimensional identity matrix, this can then be written in matrix notation as: 

(Ifct - tp Ip )msi = - tp f + fcttp - fctip ip 

njc 




/ - 'T .// " 

Recall that c = \\) and r = \\) 

/ fcf ^ 'T - /j ^ // - / j ^ / j ^ // 

(Ife,-4) 4) )mst = (^\|) Ifc,-ij) 4) Ifej+A:t4’ ys-fct4’ 4’ Yt 

r£s 

= 4>'^(^ifc, +^tyj -4’'^4’"(ifct +fciyt) 

Ks 

^ ^ fj ^ // 

If all entries of rp are appropriately small (so that the spectral radius p(tp tp ) < 1), then the inverse of (1^^ — tp rp ) exists. 
Thus, by further substituting rp^ := (1^^ — tp tp )“ tp , we can write: 


By further substituting h 
convergence: 


= 


4 >A(^lfe, + ktYs) - 4 ’A 4 >"(l/ct + ktYt) 

tZs 

^ fj . kt '■ // \ '■ /T ^ /j ^ // 

^AlTrlfe. '^kt) +kt^>Ays-ktM>A'^ Yt 

Ks 


(18) 


hi. 

ks 


^’a Ifc, 


tp^tp Ifcj, we get the following equation for the message updates Eq. (12) after 


rh^t = h + (y^ - tp"yi) 


(19) 


, , - /j ^ // 

Next notice that tp^ « tp since ^ |tp tp 

approximate h « ^(4’ IfcJ - 4^ (4^ 1/cJ = 
Plugging back into Eq. (19) finally gives us Eq. (17). 


|, and therefore (Ife, - tp'^ip")-! « Ife,. Erom this, we can now also 

— tp ^f”. Eurther ignoring the second term, we get h « . 

“ □ 


Also notice that we can alternatively write Eq. (18) as function of the uncentered beliefs, which results in a very intuitive 
equation: 


riisi = fcttp2(^ifc, +ys) - fct4>24’"(^ifet +yt) 

= fct4’A(yJ - fci4>A4>"(yt) 

= fct4’2 (y^ - 4’"yt) 


Theorem 4: The actual linearization 

Einally, by using matrix notation, we can transform and write Eq. (17) for all nodes and edges together as one large equation 
system and get Theorem 4. 


Proof Theorem 4. Eor steady-state, we can write Eq. (8) in vector form as: 


Y. 

u&N(s) 


m„ 


By permuting subscripts, we can also write Eq. (17) as 


= —c„^ + A:«4)„,(y„ - rp^^yj 


Combining the last two equations, we get 


uGN{s) “ uGN{s) uGN{s) 

1*1 2"*! 4'h 


( 20 ) 


By using our combined new vectors and matrices y, x, k, and rp (and analogously for the column-recentered residual matrix 
- // 

rp ), we can write Eq. (20) in matrix form as: 




(21) 


- /J - // ^ /j ^ // 

From Corollary 8, we know \\)^j = Therefore, from our construction we also have rp = rp . We thus get 

^ /j ^ /j ^ /j2 

y = a; + rp k + tp y - rp y 

Equation (21) is now a straight-forward linear equation system that can be solved for y. 

To finish the proof, first notice that each of our approximations become exact for limg_j.o+ • Second, notice that higher order 

^ /j ^ 'T ^ 

deltas vanish and our equation simplify toy = i; + rp k + rp y. While the individual beliefs go to zero during this limit 
consideration, their relative sizes stay the same, and thus the Maximum Marginal for each node stays the same. □ 


Proposition 5: Update equations and convergence 

Proof Proposition 5. From the Jacobi method for solving linear systems (Saad 2003) - also known as the Neumann series 
expansion of the matrix inverse - we know that the solution for y = (I — M)”^x can be calculated (under certain conditions) 
via the iterative update equation 

y('-ti) ^x + MyM 

These updates are known to converge for any choice of initial values for y*'°\ as long as M has a spectral radius /^(M) < 1. 
The same convergence guarantees carry over to Eq. (5). We thus know that the update equation Eq. (5) converges if and only if 

^ /j ^ /j2 

the spectral radius of the matrix tp — tp is smaller than 1. □ 


C Special formulations 

In this section, we derive alternative formulations of Theorem 4 for increasingly specialized cases: Graphs with node and 
edge types (Appendix C), graphs with equal number of classes for all nodes (Appendix C), graphs with one single directed 
edge potential (Appendix C), and the special case described in prior work of one single symmetric, doubly stochastic potential 
(Appendix C). 

Node types and edge types with repeated potentials 

In many realistic scenarios, the number of edges is usually larger than the number of different edge types (or edge potentials). 
For example, assume a set Q of different node types." We then have a |(5|-partite network and each node with type q G Q 
can be one of k{q) classes. Further assume that the couplings along an edge only depend on the types at both ends of the 
edge. Then there are maximal jQp different row-recentered potentials irrespective of the size of the network, whereas the most 

^ f 

general formulation of Theorem 4 would redundantly store in rp one full potential for each edge (Recall that we have one row- 
recentered potential for every edge direction, thus two for every edge type). In the following, we transform the update equations 
so that every different row-recentered edge potential appears only once in the equations. Notice that the ensuing formulation 
allows for more than one potential between any pair of node types. 

A complication in deriving a compact matrix formulation is that different types of nodes may have different numbers of 
classes. We address this issue by creating separate matrices that contain the beliefs of nodes with the same number of classes. 
Concretely, let = {1,2,..., n} be the set of all nodes, q{s) be the type of node s, k{q) be the number of classes for type q, 
and K = k{2 ),..., A:(|(5|)} be the set of numbers of classes across all nodes." Let Nk Q N denote the set of nodes with 

k G K classes so that all nodes are partitioned into groups , • ■ •, Nk\K\ ■ Let nk = \Nk \ denote the number of nodes 

with k classes. We assume a numbering of nodes such that Nk^ ={1,2,..., nk^}, Nk^ = {riki + -f 2,..., Uk^ + rifca}, 
and so on. Given this convention, each node s has a unique order o(s) within the set For each k G K, we create two 

Uk y. k matrices Y k and that contain the posterior and prior residual beliefs of all nodes with k classes. 

For each potential rp G we create two centered residual potentials rp G and rp ^ = (rp^)' G that 

correspond to the two modulations across the two directions of an edge. For notational convenience, we treat them as two 

distinct potentials and ignore their common ancestry. For example, '^12 G leads to rp ]^2 ^ and rp 2 i = rp ]^2 € 

For each newly created row-recentered residual potential rp G we create two new matrices: (i) the adjacency matrix 

W^' e with W^i (o(s), o(f)) = 1 if node s with £ classes is connected to node t with k classes via an edge potential 

rp ; and (ii) the diagonal in-degree matrix DT/ G with D™,{o{f),o{f)) = d if there are d different nodes s that are 

connected to t via an edge potential rp (notice that rp modulates along the direction s —>■ t, therefore “in-degree” at node t 
with k classes). 

"Notice our use of vocabulary: the “type” of a node in a network is observed and known a priori (e.g., whether the node represents a user 
or a product), whereas the “class” of a node is the label that we are trying to learn (e.g., whether the user is male or female). 

"in a slight abuse of set notation, we allow here e.g. {1,1, 2} to stand for a set {1,2}. 




Proposition 14 (Edge types). Let V' bet the set of all row-recentered potentials, C V be the subset with dimensions 

i X k, and Y^., X^., W^/, D^/ be the above defined partitioned matrices for all k G K. For each yf) & V' let tp be the 
corresponding column-recentered potential. The update equation Eq. (5) can then be written as follows: 

VfceX: + (WT , Yftl.'- Dj, Yfcii>"V) (22) 

with 

■= E 


Proof Proposition 14. We are going to derive Eq. (22) from Eq. (20). Eor convenience, we repeat here both equations: 


Yfc = Xfc + C;,+ ^ (WT,Y,ri)'-D“,Yfcti.'V) 

E C,. „ ^^ ^ ^ -s 

fT + z^ 

u^N{s) “ u^N{s) u^N{s) 


1 st 




4lh 


We need to show that any vector yj for a node s with k classes is equivalent to the o(s)-th row of Y k, for which we are going 
to write Yfc(o(s),:) from now on (recall that o(s) is the order of node s within Nk). We show the equivalence for each the 4 
terms separately: 

1st term: = Xfc(o(s),:) by construction. 

^ / J 

2nd term: Eor the following, recall that 


3rd term: 


(E 

ueN{s) 



E 

1 WT 

.eAf(s) 


E 

TE 

.eAf(s) 


E 


.eAf(s) 



7 Wl,(o(s),:)l„,l}ti)' 




i|) 

= Cfe*(o(s),:) 


u&N(s) 


= yl'^'ns 

u£N{s) 

u£N(s) 

= XI Xf' (o(s)’0(^))Yfc„(o(u),:)4’L 

«eAf(s) 

= E Wl,(o(s),:)Yrri)' 

= ( E WT,Y,ri,')(o(s),:) 


T 




4th term; 


u^N{s 


□ 


/j ^ // \ T ^^ //j ^ / 

^P«s4>„sysj = 2^ 

u&N{s) 

= Y1 Yfc(o(s),:)4)24>L 

iieAf(s) 

= Y1 ^lj''(o(s).o(s)) Yfc(o(s),:)4)"'^4)' 

= ( E D“,Yfc4,'V)Ks),:) 

Nodes with same number of classes k 

Proposition 14 simplifies considerably when all nodes have the same number of classes k: 

Corollary 15 (Same k). Let k be the number of classes for each node in the graph, V' the set of row-recentered residual edge 
potentials (all with k x k dimensions), Y and X the n x k dimensional final and explicit belief matrices, and'W _^i and D^?/ 

the adjacency and in-degree matrices for each potential tp G V'. The update equations can then be simplified to: 

Y ^ X + cl + ^ (WI, Yip' - D”, Yip""ip') 

ip'e’P' 

with 


4> 


(23) 


i|) e-P' 


Also the convergence criterium and the closed-form solution allow very concise formulations. For this step we need to 
introduce two new notations: Let Xj denote the j-th column of matrix X (i.e., X = {xij} = [xi... x„]) and let X and Y be 
matrices of order m x n and p x q, respectively. First, the vectorization of a matrix X stacks its columns one underneath the 
other to form a single column vector: 


:(X) = 


Xl 


Second, the Kronecker product (0) of X and Y results in a mp x nq block matrix: 

xifY X12Y ... xi„Y 

X(g)Y= : : : 

lY • ■ • XmnY I 

With these notations. Corollary 6 now becomes 

Proposition 16 (Convergence with same fc). Update Eq. (23) converges if and only if the spectral radius p(M) < 1 for 

'T „ /I'T 




tl) ev’ 


Furthermore, let y := vec(Y^), x := vec(X^), and c,, := vec(C J). The closed-form solution ofEq. (23) is given by: 


y=(lnfe-M) ^(* + 6 ',) 


(24) 


Notice that Eq. (24) is a special case of Eq. (4) that factors out repeated edge potentials. This concise factorization with the 
Kronecker product is only possible if all nodes have the same number of classes k. 


Proof Proposition 16. If all nodes have the same number of classes k then all final and explicit beliefs form single nxk matrices 




Y and X. Furthermore, all potentials have k x k dimensions. Hence, Eq. (23) can be written as a single matrix equation; 

Y = X + C, + ^ {WlM' - D^.Yri,""ri)') 

iji'e-P' 

y" = xV cl + ^ - ri)'"tj,' y"d^,) 

4i'e-P' 


with cj := i ^ transpose in order for the later vectorization vec to create vectors where the 

different beliefs of a node are adjacent (otherwise vec(Y) results in a vector where all beliefs in the same class from different 
nodes are adjacent). We next use Roth’s column lemma (Henderson and Searle 1981; Roth 1934) that states that 

vec (XYZ) = (ZT® X) vec (Y) 


to rewrite this equation as 


y = a; + c,, 


^ (WT 0 ri,'" - 0 (4>'"4>"))) y 


tl) e-P' 


for y = vec(Y^), x = vec (X^), and = vec(cj). Using the substitution 




tl) ev> 


and reforming the equation leads to the closed-form solution: 

y=(lnfe-M) ^(x + C^) 

which is defined if the spectral radius p of M is smaller than 1. □ 


One single directed edge type 

Equation (23) simplifies further when we have just one single edge potential. In other words, we have a directed network and 
assume only one single type of edge whose meaning changes across the two directions (e.g., who follows some type of person 
on Twitter has a different meaning from who is followed by same type). 

Corollary 17 (One potential). Let k be the number of classes for each node in the graph, rp the k x k-dimensional potential 
across an edge in the direction from source to target, Y and X the n x k dimensional final and explicit belief matrices, W 
the directed adjacency matrix, and D‘” (0°^^ ) the in-degree (out-degree) matrices. Then the update equations simplify to: 

Y ^ X + cl + WTYii)' + WYii)""^ - (25) 

with 

cl := i(WTl„lT4,' + Wl„lTii,"^) 

Corollary 18 (Convergence). Update Eq. (25) converges if and only if p^M.) < Ifor 

M = WT (g) 4- W (g) rp" - (g) - D°"‘ (g) 


One symmetric, doubly stochastic potential 

Recent work (Gatterbauer et al. 2015) derived a linearization of BP for the special case of one single symmetric, doubly 
stochastic edge potential that is used throughout the network (Recall that for such a potential all residual row and column sums 
are 0, and that by multiplying it by the number of classes it will be centered around 1). We can recover this special case from 
Corollary 15 and Proposition 16 with a slightly updated notation. 

Proposition 19 (One symmetric, doubly stochastic potential). If the MRF contains only one single edge type with a symmetric 
doubly stochastic potential tp, then the update equations simplify to: 

Y ^ X + WYrp' - DYip'^ 




At the same time, the closed form solution simplifies to: 

y = (Infc - W 0 ij’ + D (g) tj) (26) 

Proof Proposition 19. First, notice that for any symmetric potential ip £ ip = rp = \p/A:, and hence Wl,Y\p + 

^ ^ If ^ ^ f 

W^„ Ytp = (W^, + W^'jYtp . Thus, its adjacency matrix becomes symmetric. Since we only have one potential, we also 

^ ^ I ^ Ifj ^ I ^ /2 

have only one adjacency matrix W. Furthermore, rp = tp and hence, tp tp = ip . 

-V / 

Second, the constant term C,,, disappears for doubly stochastic potentials. This follows from the proof of Lemma 12 and the 
fact that in any doubly stochastic matrix rp € each column is centered around 

This allows us to simplify Eq. (23) to 

Y = X + WYip' - DYrp'^ (27) 

Similarly, applying above assumptions to our closed-form solution Eq. (24) leads to; 

y = (I„fc - W (g) rp-f D (g tp ) □ 

Notice that Eq. (27) and Eq. (26) are exactly the ones given by (Gatterbauer et al. 2015), except for slightly different notation. 
In particular, the authors choose to center the potential tp around 1 /k, which is possible in the case that all nodes have the same 
number of classes k (and thus all potentials are quadratic with k x k dimensions).'^ We also chose here to formulate Eq. (26) 
as y = vec(Y^) instead of vec(Y) to keep the beliefs of same nodes adjacent in the resulting stacked vectors. Vectorizing 
instead the transpose, we get the exact original formulation: 

vec(Y) = (I„fc - rp' g W-f rp'^ g D) ■%ec(X) (28) 

- / 

Notice that in a slight abuse of notation, we used rp in Theorem 4 for the sparse fctot x ^tot-square matrix, whereas we use 
it here for the single k x k recentered residual potential. 

D Illustrating examples 

Example 20 (Linearization). Consider the network Fig. 7a consisting of nodes N = {1,2,3}. Node 1 has three classes, 
whereas nodes 2 and 3 have two classes. We have two edges, e.g., the edge between nodes 1 and 2 with a 3 x 2 potential ^ 12 - 

Fig. 7b illustrates Eq. (3). Notice that = rp k with 5 ]''^ fctot = 3 + 2 + 2. Further notice that 

- Ij2 

the matrix rp is block-diagonal (entries represent the echo modulation that a node receives through all its neighbors). In 
the following, we write {■) 2 for the projection of a stacked vector on the entries for node 2, e.g., (y )2 = y 2 -' 

{x)2 = X2 

(cj 2 = jtpiUa -b 11 P 32 I 2 
(4>''^y)2 = 4>i2yi -b tp32y3 
(tp y )2 = (lpl 2 tp 21 -btp32tp23) y 2 

4 > 2 . 

Then, the single update equation could also be written as several simultaneous update equations: 

y^^xiF {cf)i + rp 2 \y 2 - tPi*yi 

y2 ^ *2 -b {cf)2 + tpl2yi -b tp32y3 “ 4>2*y2 

y3 ^ *3 + (c{)3 + rp'23yi - rp3^y3 

Example 21 (Repeating potentials). We use Example 20 to also illustrate Proposition 14. Let ygj be the belief of node s in 
class j. We create two belief matrices Wk € K = {2, 3}.' Y 2 = [yll yll], and Y 3 = [t/n V12 vn]. Thus, for example, the 

*^“Row-recentering” and “column-recentering” as proposed in the present paper are more general forms of the centerings proposed in 
earlier work, which is necessary in order to deal with the general case of a non-quadratic and non-doubly stochastic potentials. 
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Figure 7: Example 20: (a): Network with 3 nodes and 2 edge potentials, (b): Resulting equation system where x contains any prior beliefs 
and c* is a vector that depends only on the graph and potentials. 




» ^ I ^ I ^ I 

beliefs of node 2 are in row 1 0 /Y 2 (also written as o(2) = Ij. We have four row-recentered matrices: 4 ^ 21 ’ 4*23 

^ //j ^ ^ / 

with corresponding echo cancellation potentials (e.g., 11)32 4 * 12 /'^^ 4 ’i 2 A appropriate adjacency and in-degree matrices. 
- / 

For example, 4^12 G 
in detail: 


32 


has W_^' =[ 10 ] where the first entry indicates an edge from node 1 to node 2. We illustrate next 


W,/ =1 


)1>12 3 
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'i ’21 3 

■ 1 ■ 
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D”, =1 

'i ’21 


1 ]. 


D”, = " 

^23 3 


We then get the following update equations: 

Y 2 ^ X 2 


■"2» 


D'?, = " 

4>32 3 


ro 0 
1 1 0 


[0 1 
lo 0 


Y3 ^ X3 + C3, + [ 1 0 


Y34^12 - 
Y24’23 “ 
Y24’32 ~ 
Y24’21 ~ 


1 0 
0 0 


ro 0 
[0 1 


* //j * / 

Y2ll)i2 4^12 

I Y24)23 4’23 

[0 0] ^24^32 4^32 

[i]Y34)2iV2i 


with 


C'' — ^ 

'-" 2*-3 


lll]t|)l2 


+ |[?S][ni4>23+|[S^,][l}]4>3 


E Weighted edges in MRFs 

The notion of “weights” on edges in MRFs is not defined and it is not immediately clear what an appropriate semantics would 
be. Here we give an a natural interpretation of edge weights in MRFs and derive a modification of the linearization to handle 
such weighted edges. We derive this interpretation starting from one single axiom: 

Axiom 22 (Edge weights). An edge with weight w G N behaves identically as w parallel edges with the same potential but 
weight 1. 

From the original BP update equations Eq. (1) and Eq. (2), we see that two parallel edges carry the same messages, and that 
these two messages need to be multiplied to calculate the resulting messages and beliefs. It follows that these parallel edges 
behave identically to having one single edge with a new potential © rJ)j,(, i.e., the result of element-wise multiplying the 
entries of the original potential. More generally, an edge with a potential i|) and weight w is the same as an unweighted edge 
with a new potential r[)^ with entries entries fiwUy *) = *)™- 

To see how weights affect our linearized formulation in terms of residuals, recall that 'tpijfi) = 1-1- Therefore, 

= (1 + = 1 -F w'ijj{j, i) -F if). Under the assumption of small deviations from the center, we thus 

get: '0U, (j, i) = wflj, i). Hence, weights on edges simply multiply the residual potentials in our linearized formulation. In other 
words, weights on an edges simply scale the coupling strengths between two neighbors. 

It follows that Proposition 14 can be generalized to weighted networks by using weighted adjacency matrices W^/ with 

elements (o(s), o(f)) = w > 0 if the edge s —>■ f with potential r[) and weight w exists, and W^i{o(s),o{t)) = 0 






























































Figure 8: Example 24: Illustrating example for BP and its linearization “Lin”. Details are given in the text. 


otherwise. In addition, each entry {o{t),o{t)) in the block-diagonal matrix D^/ is now the sum of the squared weights of 

edges to neighbors that are connected to t via edge potential ip, instead of just the number of neighbors (recall that the echo 
cancellation goes back and forth, and notice again that the potentials work along the direction s —> t). After this modification. 
Proposition 14 can immediately be used for weighted graphs as well. 


Example 23 (Edge weights). We give here a small detailed example that shows the ejfects of weights for a potential whose 
entries are not really close to each other (i.e., the average entry is 1, however entries can diverge considerably from 1). We 
start with the potential rp = \ By dividing all entries by 6, we get an equivalent potential that is centered around 1; and 

from this we get the residual and the row-recentered residual matrices: 




1 [4 6 51 
6 l6 8 7J > 


4> 



j_ ri -1 0] 
18 Li -1 oj 


4^ 6^ 5^ 

g2 g2 72 


. And the residual and row-recentered residual matri- 


The squared potential centered around 1 is then: rp 2 = pfg 
ces: 

,1, [0.575 0.044 0.3361 ,1,' [C 

^2 ~ L 0.044 -0.699 -0.300 J > ~ [c 

We can now compare the potential we get by multiplying the residual by 2, or by squaring the original potential and then 

recentering: 

2li) « ih' R:i 

Zip [0.111 -0.111 oJ J V2 Lc 

We see that the overall direction is correct, but there are considerable differences (e.g., « 30% relative difference for the first 

matrix entry: 0.111 vs. 0.085). 

We next bring each entry in the potential closer to the center. Concretely, we reduce the deviation by one order of magnitude: 


‘0.085 -0.091 0.006] 
.0.121 -0.127 0.006 J 


‘0.085 -0.091 0.006] 
.0.121 -0.127 0.006 J 


Now both versions are very close (e.g., 

2rp R 


5.8 6.0 5.9 
6.0 6.2 6.1 


^_j,r-2o-ii ^'_j_ri-ioi 
^ “ 60 L 02 1 J ’ ^ “ 180 Li -1 oJ 

2% relative difference for the first matrix entry: 0.0111 vs. 0.0109): 


‘0.0111 -0.0111 0] 
.0.0111 -0.0111 oJ 


4^2 


ro. 

Lo. 


0109 -0.0110 0.00005' 
0113 -0.0113 0.00005. 


F Illustrating examples 


Example 24 (Convergence). We illustrate the different convergence behaviors of BP and our formalism in a graph with 
several different potentials. The scenario is a variant of an example given by (Heskes 2002) of a 4-clique graph with weights 
on the edges (see Fig. 8a). The weights are used to entry-wise exponentiate the entries of the potential rp = [f\] before 
normalization: . For example, a weight 2 leads to(^ 2 ) — [ \^ lel’ cind a weight—3 leads to'\\) = 


A 1 


. The graph has 2 nodes A and B with prior beliefs: = xq = [0.8,0.2]'' 


Figure 8b shows the beliefs for nodes A and C for various iterations of BP. The dashed lines show the maximum marginal 
(MM) distribution as determined by complete iteration. We see that BP has a somewhat erratic cyclic behavior and does not 
converge. Figure 8c shows a variant that uses damping (Roller and Friedman 2009, ch. 11.1), a method that is often used to 
make BP converge when it otherwise would not. Damping with 0.1 (meaning an updated message is a linear combination of 
90% the prior message and 10% of newly calculated values) is able to dampen the the behavior, but convergence happens 
only after 1,000s of iterations, and even then the maximum marginal for node A is wrong (0.48 leads to choosing class 2, vs. 
0.51 leads to choosing class 1). Furthermore, after replacing rp in our example with [ f | ], BP will not converge anymore for 
any damping factor and the fixed points of BP are unstable. 
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Figure 9: Example 25: (a) Example network, (b) BP and the linearization lead to different predictions for the colored parameter choices. 


Figure 8d illustrates the convergence of the linearization: The convergence boundary is e* = 0.3109. By choosing a conver¬ 
gence parameter s = ^ = ^> the values converge after a few iterations. In addition, the MM beliefs coincide with the exact 
solution. 


Example 25 (Errors of linearization). We know that the approximation becomes worse if values are not close to the centering 
point. We give here a simple example to illustrate the impact this happens on the actual node classification. Consider the 
network in Fig. 9a with n = 4 nodes and each node having k = 2 classes. Let xa = xb = and xc = 

where x '.= 1 — x. We will calculate = [z, z]’’’ as a function of x and y (representing the prior beliefs at the nodes in the 
respective classification) and then assign the class 1 to D if z > z, or equallyz > 0.5. 

For BP, the following condition needs to hold for D to be labeled as class 1: 

x^y > x^y 

In contrast, for the linearization, the following condition needs to hold (notice that we formulated the condition on the 
residuals): 

2x y > 0 

Both conditions are equivalent close to the centering point 0.5, however diverge further away from that point. Figure 9b illus¬ 
trates 4 regimes for parameters x and y: In the large gray area (upper right) both algorithms are give identical predications 
and assign class 1 (analogously for the white area in the lower left). But in the blue area (lower right), BP assigns class 2 
whereas its linearization 1 (analogously for tlw yellow area in the upper left). Now consider the red cross marking the point 
(x = 0.7, y = 0.14) in the blue regime. BP calculates z = 0.47 whereas the approximation calculates z = 0.54. 


G Existing graph generators and hardness of node labeling 

There is a large body of work that proposes various realistic synthetic generative graph models. However, almost all of this 
work assumes unlabeled graphs. While one could use these existing graph generators to have realistic graphs, one cannot easily 
take a graph and then label the nodes according to some desired compatibility matrix. In fact, this problem is NP-hard. 

Proposition 26 (Labeling with potentials). Given a graph G{V, E). Finding labels £ : u G E —)■ [fc] so that the labels follow 
a given stochastic affinity matrix rp (where '0(*, j) determines the average fraction of nodes of class j connected to a node of 
class i) is NP-hard. 


Proof Proposition 26. Membership in NP follows from the fact that we can easily verify a solution by calculating the average 

neighbor-to-neighbor relations in a labeled graph. We use a simple reduction from the problem of Graph 3-colorability. Graph 

3-colorability is the question of whether there exists a labeling function k : v —?> {1, 2, 3} such that k{u) f k{v) whenever 

(u, v) G E for a given graph G = (E, E) and is well known to be NP-hard (Stockmeyer 1973). Assume now that we have 

'oil' 

1 0 1 
1 1 0 


a method that allows us to label any graph G{V,E) following the heterophily matrix rp = 


i.e., neighboring nodes 


never have the same label. It follows immediately that such a solution would also be solution to graph 3-colorability. 


□ 


We thus need graph generators which generate both the graph topology (i.e., W) and the node labels (i.e., X) in the same 
process. We know of only two papers that have proposed graph generators that generate labeled data in the process: the early 
work by (Sen and Getoor 2007) and the very recent work by (Lee et al. 2015). Neither graph generator is available. In addition. 
















neither of the papers gives a way to know the “ground truth” actual potential matrix that was used to label data (e.g., (Lee et al. 
2015) suggests this as future work). 

We therefore had to implement our own synthetic graph generator with the additional design decision that any potential 
matrix can be “planted” as exact graph property. This allows us to separate the concern between (1) how does our method 
work on graphs with certain properties, (2) what is the variation in properties of a given generative model. By planting exact 
properties (instead of expected properties) we can focus on question (1) only. The random graph generator is described in detail 
in (Gatterbauer 2016). 
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